fix(ci): re-run only failed specs on Windows test retries (PER-9011)#2267
Conversation
Windows @percy/core CI took ~241 min vs ~19 min on Linux. The cause was not a single slow run: ~1 of 1078 node specs flakes per run on Windows (browser/local-server timing), and windows.yml retried the entire ~60-min suite up to 4x, with continue-on-error masking the failures as green. Make retries cheap instead of fighting every flaky spec: - scripts/test.js records failed spec names to PERCY_NODE_FAILURES_FILE and, when PERCY_ONLY_FAILED_SPECS=1, uses a jasmine specFilter to re-run only those specs. A filtered run reports "incomplete" (rest skipped), so success there means the targeted specs passed; full runs keep jasmine's strict status. No behavior change when the env vars are unset (Linux/local). - windows.yml: retry0 runs the full suite; retries 1-4 re-run only the specs that just failed (seconds, not ~60 min). - Add a ::warning:: when a retry recovers a flake so "green via retry" stays visible. Verified locally against the real @percy/core suite: a 28-spec failure set re-ran in 12s vs 1304s for the full 1078-spec suite. Expected on Windows: ~241 min -> ~62 min, full coverage kept. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cli-doctor's env-audit check flags any process.env key starting with PERCY_, so injecting PERCY_NODE_FAILURES_FILE at the job level broke its "no Percy vars are set" specs on Windows CI. Rename to CLI_TEST_FAILURES_FILE and CLI_TEST_ONLY_FAILED (non-PERCY_) so the retry plumbing stays invisible to Percy-env detection. Verified locally: cli-doctor 508/508 pass with the renamed vars (both specs fail when a PERCY_-prefixed var is injected, confirming the cause); spec-level retry behavior unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| # specs that flaked instead of the whole ~60-min suite. See PER-9011. | ||
| # Name is deliberately non-PERCY_ so it doesn't trip env-audit tests. | ||
| env: | ||
| CLI_TEST_FAILURES_FILE: ${{ github.workspace }}/.cli-test-failures.json |
There was a problem hiding this comment.
why only implement for windows?, lets do for linux as well will help in improvement overall
There was a problem hiding this comment.
Good call — done in 9c81399. The runner support in scripts/test.js was already platform-agnostic; only the wiring was Windows-only, so I've extended it to test.yml (Linux) too. Linux flakes on the same timing-sensitive specs and had no retry at all.
One Linux-specific nuance worth flagging: Linux runs test:coverage with a 100% coverage gate (.nycrc), and a retry that re-runs only a few specs can't reach 100%. So I structured it as:
- retry0 →
test:coverage(full suite + coverage gate, records any failed specs) - retries 1–4 → plain
teston only the failed specs, without the coverage gate (a subset can't hit 100%) - if retry0 fails with no recorded spec failures (i.e. a genuine coverage drop or a crash, not a flaky spec), the runner now preserves that failure instead of masking it — so real coverage regressions still fail the build.
Verified locally before pushing: empty-failures retry → exit 1 (preserved); real-failure retry → runs only that spec → exit 0; and the coverage path correctly writes the failures file the retries consume. CI is re-running now.
Per review feedback: the spec-level retry was only wired into windows.yml, but the runner support in scripts/test.js is platform-agnostic. Wire it into test.yml (Linux) too — Linux flakes on the same timing-sensitive specs and has no retry today. Linux runs test:coverage with a 100% coverage gate (.nycrc), which needs care: - retry0 runs `test:coverage` (full suite + coverage gate, records failed specs) - retries 1-4 run plain `test` on ONLY the failed specs, WITHOUT coverage (a subset can't hit the 100% threshold) - if retry0 fails with no recorded spec failures (a real coverage drop or a crash, not a flaky spec), the runner now preserves the failure instead of masking it Verified locally: empty-failures retry -> exit 1 (preserved); real-failure retry -> runs only that spec, exit 0; the coverage path writes the failures file that retries consume. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
The Windows
Test @percy/corecheck takes ~241 min, vs ~19 min for the same check on Linux (Linux even runs with coverage). This slows every PR's required Windows checks. Tracked in PER-9011.Root cause
The 241 min is not one slow run — it's the same
@percy/coresuite run 4 times by the retry loop in.github/workflows/windows.yml:Run testsRetry (1/4)Retry (2/4)Retry (3/4)continue-on-error: truerewrites each failed step'sconclusiontosuccess, so the check looks green even though 3 of 4 attempts failed. Only ~1 spec out of 1078 flakes per run (browser/local-server timing on Windows) — and it's a different spec each time (e.g.Discovery > captures favicon when the server provides one,... promise only for sync snapshot). Re-running the entire ~60-min suite to recover one flaky spec is the real waste.Fix: make retries cheap instead of fighting every flaky spec
scripts/test.js— the Jasmine node runner records failed spec names toPERCY_NODE_FAILURES_FILE; when run withPERCY_ONLY_FAILED_SPECS=1it uses aspecFilterto re-run only those specs. A filtered run reportsincomplete(the rest are intentionally skipped), so success there means the targeted specs passed; full runs keep Jasmine's strict status. No behavior change when the env vars are unset (Linux/local/coverage all unaffected)..github/workflows/windows.yml—retry0runs the full suite once; retries 1-4 setPERCY_ONLY_FAILED_SPECS=1so they re-run only the spec(s) that just failed (seconds, not ~60 min).Flag flaky testsstep emits a::warning::whenever a retry recovered a flake, so "green via retry" stays visible.Verification (local, against the real
@percy/coresuite)incomplete, so the exit logic now treats "targeted specs passed" as success (otherwise every retry would have looked like a failure).Expected impact
~241 min → ~62 min for
@percy/coreon Windows (one full pass + cheap retries), full coverage preserved.Draft until Windows CI confirms the timing.
🤖 Generated with Claude Code